Skip to content

Add support for filecrushing on Elastic MapReduce#2

Open
alexanderdean wants to merge 4 commits intoedwardcapriolo:masterfrom
snowplow:master
Open

Add support for filecrushing on Elastic MapReduce#2
alexanderdean wants to merge 4 commits intoedwardcapriolo:masterfrom
snowplow:master

Conversation

@alexanderdean
Copy link
Copy Markdown

Work-in-progress PR - do not pull yet

Hi @edwardcapriolo - this is an open pull request to add support for using filecrush on EMR.

There are three main things to fix:

  1. Instantiating the right type of FileSystem
  2. Fix the location of tmpDir - I think we should be referencing "${hadoop.tmp.dir}" rather than raw new Path("tmp/crush-" + UUID.randomUUID());
  3. Replacing the fs.makeQualified(dir).toUri().getPath() pattern with something that doesn't strip important S3 bucket information
    License is missing #1 is done, see PR. Add support for filecrushing on Elastic MapReduce #2 is doable. V2 creates job with only one reduce #3 is a bit harder - I am working through this for EMR, but might need some help from you to make sure my changes don't break filecrush on standard HDFS.

Hoping this is the start of a collaboration! We're really excited about filecrush here at Snowplow.

@edwardcapriolo
Copy link
Copy Markdown
Owner

It all looks good so far. Just let me know when you want me to merge.

@alexanderdean
Copy link
Copy Markdown
Author

We ended up not using this library in the end. :-) You can merge as-is if you like, or close. I'll delete our fork in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants